The data set explored here contains information on 6497 red and white variants of Vinho Verde wine. Vinho Verde is a denomination of controlled origin, which means that all wines in this data set come from the same region in northern Portugal. Nonetheless, the denomination is broad enough to contain high variety in production methods and styles.
The data was separated in two files, one contains data for red wines, the other for white wines.
These data sets can, of course, be analyzed separetly. However, I’m interested to see if there are significant diferences between the red and white variants, both in physicochemical measures and in evaluated scores. It’s important to observe the data is overrepresented by white wines.
##
## red white
## 1599 4898
## 'data.frame': 6497 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ variant : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
## $ total.acidity : num 8.1 8.68 8.56 11.48 8.1 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
##
## alcohol quality variant total.acidity
## Min. : 8.00 Min. :3.000 red :1599 Min. : 4.110
## 1st Qu.: 9.50 1st Qu.:5.000 white:4898 1st Qu.: 6.710
## Median :10.30 Median :6.000 Median : 7.300
## Mean :10.49 Mean :5.818 Mean : 7.555
## 3rd Qu.:11.30 3rd Qu.:6.000 3rd Qu.: 8.050
## Max. :14.90 Max. :9.000 Max. :16.285
##
## quality.as.factor
## 3: 30
## 4: 216
## 5:2138
## 6:2836
## 7:1079
## 8: 193
## 9: 5
This will be a good visual reminder of how overrepresented white wines are in this data set.
I’ve plotted histograms separate by variant because I’ve heard many times that red wines have higher alcohol levels. These plots seem to contradict that, if the distributions are different it seems whites would be more alcoholic. But it might be an artifice of the inbalance of the data. This requires further investigation.
Sulphates are often said to be the culprit in wine related head aches. And red wine is said to cause more often and more severe head aches. There doesn’t seem to be significant differences in the distributions, but I’ll investigate this a little more.
Density seems to be bimodal. The faceted plots indicate there is likely a difference in distributions for red and white wines.
All of the acidity variables seem to somewhat follow a normal distribution. There is one possible exception with volatile.acidity. There might still be some interesting relationship between these variables.
Interesting, I expected residual.sugar to inversely correlate strongly with alcohol since the sugar is converted to alcohol in the fermentation process. But that doesn’t seem likely anymore, due to how different the histograms look.
Also, dry wines are likely the ones with the least amounts of residual.sugar, so it looks like white wines are more likely sweet then red wines.
It seems total.sulfur.dioxide is bimodal. Like density, this might be a distinguishing feature for red vs white wines.
Quality is overwhelmingly 5 or 6 for both wines. The side by side plot seem to indicate that white wines are slightly better evaluated.
The data set consists of 6497 observations of physicochemical measures and subjective sensory avaluations. There is also a factor variable describing if any given observation is from a red or white wine.
The variables are:
11 quantitative mesures, like alcohol (in % by volume) or pH.
1 numerical qualitative sensory evaluation. Given as the median of at least 3 evaluations made by wine experts. Scale ranges from 0 (bad) to 10 (good).
1 indexing variable.
There are also 2 variables added:
1 categorical variable identifying red or white wine.
1 agregate variable total.acidity and was created beacause this information is usually provided on labels.
Most variables seem to follow a bell-shaped distribution, but not necessarily Gaussian. The alcohol variable seem to have a more spread out distribution while residual.sugar appears to be left skewed. Also, total.sulfur.dioxide looks bimodal on agregate, but it seems to be explained by a difference in distributions for red and white wines.
Surprisingly, so far, red and white wines seem to follow the same distribution on all of the investigated variables, with the exeption of total.sulfur.oxide and, possibly, alcohol and quality. All of these exceptions appear to be skewed in favor of larger values for the white variaty. This goes against my intuition in the cases of alcohol and quality
I’m interested in the quality score, more precisely, I’m interested to see how this variable relates to physicochemical ones, specially those contained in labels.
I’m also interested in the variant variable. Specifically, I’m curious if the physicochemical variables could indicate subtle categorical differences in the variants. Regarding the variants, I’m also interested to see if there is a bias in the expert community in evaluating these wines.
Wine experts like to say the 4 most import features of wine are alcohol, tannin, sweetness and acid profile. The data set has a direct alcohol variable, so that looks promising. The sweetness feature is a function of the residual.sugar variable. Acid profile is trickier. There is the obvious pH variable, but flavors come with which type of acid is present and in which concentration. The variables fixed.acidity, volatile.acidity, and citric.acid should help. Unfortunately, none of the variables are related to tannin as far as I could find. These are supposed to be complex organic molecules and don’t seem to relate to any of the chemical quantities described in the data.
Yes, I created total.acidity. It is just the sum of 2 other variables, fixed.acidity and volatile.acidity but the interest in this particular quantity is due to the fact that this is the information most universaly contained in wine labels.
Looks like the higer quality wines are more alcoholic. Let’s see if a box plot looks better.
The median level of alcohol is clearly increasing with score. There are quite a few outliers with high alcohol level and quality score of 5, though.
Also, the distribution for alcohol now seems to be the same for red and white wines, so I’ll just plot a qq-plot.
The distributions are close enough to one another to be considered the same.
Interquartile range seems to be decreasing whith the increase in quality in the agregate, but increasing for each variant.
There also seem to be some difference in the acidity distributions of red and white wines. I’ll plot a qq-plot to find out.
Alright, that settles it. Red wines are definetly more acidic.
Now, getting back to quality, let’s see if citric.acid gives us some insight.
Now, that is interesting! The citric.acid variable seems to predict quality for red wines only.
In all of the above boxplots there is an omission, a white wine outlier with 65.8 g/L in residual sugar.
## quality variant residual.sugar alcohol density
## 4381 6 white 65.8 11.7 1.03898
We can observe the median values go up and down with quality due to the white wines. Red wines have a veri consistent median across all quality values, which was to be expected since most residual.sugar values are in a very narrow band of values. Also, white wines are largely sweeter then reds in their distributions, no qq-plot required for this observation.
Quality seems to be inversely correlated with quality. This relationship appear stronger for white wines. If memory serves me right, alcohol has a lower density then water. So, this relationship might be superfluous. Let’s see how related are they.
In all of the density plots there is an omission, the same white wine outlier with 65.8 g/L in residual sugar, which has a density of 1.03898 g/L.
Let’s check if residual.sugar explains density well.
These are certainly 2 distributions plotted together. Most likely, this is a distinguishing feature for red and white wines. Let’s see, subseting on wines with less than 20 g/L in residual.sugar
Interesting, white wines seem to have 2 distinct distributions in respect to density. One of these behaviors seem to follow closely what happens with reds, but in lower densities.
Much like citric.acid, median sulphates values increase with quality. Let’s see if there is some interesting relationship between them.
Disappointing! I’ll investigate in more detail later.
Experts advocate for a balance in alcohol and acidity. I would expect from that for alcohol and total.acidity to show some sort of correlate tendency. The fact they don’t might explain why most wines are not well evaluated.
As bivariate distribution goes, red wines have more variance within both acidity measures.
White wines are contained in narrow bands in the acidity categories. For red wines, citric.acid and fixed.acid are well correlated.
Overall, only alcohol and density seems to have any kind of significant relationship with quality. However, these 2 variables have significant correlation with each others. These overall observations seem to be the only ones aplicable for white wines, which are overrepresented in the data set. It is worth mentioning that the relationship between density and quality is stronger for white wines. Every other interesting relationship with quality apply well enough only for red wines, and these are between quality and alcohol, sulphates and citric.acid.
As for distinguishing the variant variable, in adition to sulphates and citric.acid, residual.sugar are likely distributed differently between variants.
Yes. The variables density and residual.sugar seem to have distinct distributions for red and white wines. More over, there seems to be 2 distinct relationships between these variables within white wines. This indicates there is a hidden categorical variable in the white wines category.
Overall, the relationship of alcohol and quality was the strongest. When restricting to red wines, I would say citric.acid and quality showed the strongest relationship.
Starting with the curious relationship between density and residual.sugar. It seems that, for white wines and fixed density, an increase in residual.sugar is related with an increase in quality.
Let’s say wines with less then 3 g/l in residual.sugar are dry wines and the rest are sweet wines. Let’s check that we can distinguish red from white dry wines.
Interesting, the relationship with quality observed for white wines seems to exist for the dry wines as well.
When restricting to sweet wines, this low density higher residual.sugar relationship looks stronger.
Notice that, in the above plots, the y-axis measures are different linear combinations of the density and residual.sugar.
Signals found to discriminate quality seem to work well (linearly), together. To build a predictive model for quality, I would recomend a hybrid aproach. The first step would be to use the bimodal distributions like residual.sugar to create a dry-red, dry-white or sweet classifier. Next, a different model for each, which should benefit from dimensionality reduction (PCA looks good as relations seem linear).
This an example of other feature relations. We can observe that the originaly observed tendency to distinguish quality is mostly contained in one direction.
In both of the features of interest, residual.sugar and density worked together to strengthen the signal of good vs. bad quality wines and distinguishing red vs. white variants.
Yes. There are 2 distinct behaviors in the relation between density and residual.sugar. I believe these distinct behaviors distinguish dry and sweet wines. The surprising relation is that, for dry wines of some fixed density, an increase in residual.sugar is associated with an increase in quality.
The relationship of Density and Residual Sugar appear to follow 2 distinct trends. This is likely a difference in composition of dry vs. sweet wines.
The relationship of Density and Residual Sugar also appear to allow for a linear model to make good predictions on classifying red from white wine variants.
For dry vs. sweet wines, different linear combinations of Density and Residual Sugar seem to be usefull in predicting good wines. Moreover, for dry wines, red and white variants seem to have somewhat distinct characteristics for predicting good wines.
The Vinho Verde data set consists of 6497 observations of physicochemical measures and subjective sensory avaluations. The set is unbalanced, containing 3 times more white wines than red wines (4898 vs. 1599).
There are only 13 variables, which made possible to avaluate most of them indivualy, before looking for relationships within variables. In this first exploration, some variables had interesting behaviors. Alcohol had the more spead out variance. Some variables looked bimodal, which made for interesting explorations in the potential differences of red and white wines. Density ended up being the most interesting of these variables.
In exploring quality and how that relates to other variables, alcohol was the clear signal of good wines in agregate. We ended up finding the unbalanced data were hiding some quality identifying variables, as red wines had more variables relating to quality.
But, in the end, it became appearant there were 3 distinct categories of wine, for which the physicochemical variables held different relations to quality.
With the above realization in mind, future work aiming to create predictors for good wines should focus on a strategy to, in a first step, trying to classify wines in one of 3 categories: red, dry-white or sweet-white. Both in identifying red from white or dry from sweet, density and residual sugar are usefull variables. There are some variables which seem promising in the identification of red wines from white ones. These include total.sulfur.dioxide, total.acidity and sulphates. A different, specific model can be made for each of these 3 classified wines. Alcohol is relevant for all of these models. For white wines (both dry and sweet), density and residual sugar seem to be the best variables to include in the model (after alcohol). For red wines, on the other hand, other variables seem best, like sulphates or citric acid.
The most relevant limitation of the data set is the Denomination of Origin. All of the wines in the data set are from the same region in Portugal. The results are likely not generalizable. It was somewhat troublesome to deal with the unbalance in the variaties. This was specially the case since red wines had more variables signalling good wines. When taking the agregate data these relationships were masked.